Red Wine Quality Data Analysis

Udacity Data Analyst Nanodegree

P4: Explore and Summarize Data

by D. Satas

October’2016


About the Dataset

This dataset is public available for research. The details are described in [Cortez et al., 2009]. P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

Available at:

  1. Elsevier

  2. bib

Objective of the Analysis

Prediction of the quality ranking by tasters from the various measured properties of red wines to guide grape growers and wine producers regarding a wine quality. Do some of these properties have a significant effect on quality? If so, which ones?

Data Overview

Variable description:

Input Variables:

  1. fixed acidity (tartaric acid - g/dm^3): most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

  2. volatile acidity (acetic acid - g/dm^3): the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

  3. citric acid (g/dm^3): found in small quantities, citric acid can add “freshness” and flavor to wines

  4. residual sugar (g/dm^3): the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

  5. chlorides (sodium chloride - g/dm^3): the amount of salt in the wine

  6. free sulfur dioxide (mg/dm^3): the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

  7. total sulfur dioxide (mg/dm^3): amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

  8. density (g/cm^3): the density of water is close to that of water depending on the percent alcohol and sugar content

  9. pH (scale between 0 and 14): describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

  10. sulphates (potassium sulphate - g/dm^3): a wine additive which can contribute to sulfur dioxide gas (SO2) levels, which acts as an antimicrobial and antioxidant

  11. alcohol (% by volume): the percent alcohol content of the wine

Output Variable (based on sensory data):

  1. quality (score between 0 and 10)

Dataset modifications:

The dataframe is replaced by a subset of itself with following modifications:

  1. Added a column \(quality.f\) with the quality values as a factor type.

  2. Removed first row ID column - it doesn’t have any value to the analysis.

Quick view of the dataset statistics

The information about the structure of the dataframe and variable data types.

## 'data.frame':    1599 obs. of  13 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  $ quality.f           : Factor w/ 10 levels "1","2","3","4",..: 5 5 5 6 5 5 5 7 7 5 ...

The dataset consists of 13 variables with 1599 observations. There is an aditional variable \(quality.f\) created as a factor of the quality scores and will be used to create a model. The variable \(quality\) is integer type, \(quality.f\) - factor type, the rest are numeric type.


Descriptive statistics of every variable in the dataset.

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##                                                                   
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##                                                            
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##                                                                   
##     quality        quality.f  
##  Min.   :3.000   5      :681  
##  1st Qu.:5.000   6      :638  
##  Median :6.000   7      :199  
##  Mean   :5.636   4      : 53  
##  3rd Qu.:6.000   8      : 18  
##  Max.   :8.000   3      : 10  
##                  (Other):  0

The summary statistics above include the mean, standard deviation, range, and percentiles. It reveals the mean for most variables is greater than the median. This indicates that there are outliers. Only \(density\) and \(ph\) have median about the same as the mean, the sign of normal distribution. The \(quality\) min value is 3, max - 8, that might indicate that our dataset doesn’t include any measurements of worst or best quality wines. Variables \(residual.sugar\), \(chlorides\), \(free.sulfur.dioxide\), \(total.sulfur.dioxide\) have outliers very far away, since the max values are way above the 3rd quartile.


Note: To save space there is no measurement units indicated in the following plots, charts or graphs in the analysis. Please refer to the table below, if needed.

##                var_name measure_units
## 1         fixed.acidity        g/dm^3
## 2      volatile.acidity        g/dm^3
## 3           citric.acid        g/dm^3
## 4        residual.sugar        g/dm^3
## 5             chlorides        g/dm^3
## 6   free.sulfur.dioxide       mg/dm^3
## 7  total.sulfur.dioxide       mg/dm^3
## 8               density        g/cm^3
## 9                    pH    scale 0-14
## 10            sulphates        g/dm^3
## 11              alcohol             %
## 12              quality    scale 1-10
## 13            quality.f    scale 1-10

Univariate Plots & Analysis Section

The histograms and bar plot to explore the distribution of each explanatory variable. I am not sure, if they are completely independent.

**Figure 1.**

Figure 1.

As shown in Figure 1, \(quality\) variable has most values concentrated in the categories 5, 6. Only a small proportion is in the rest of categories. There is no values in category 1, 2, 9, 10. Variables \(residual.sugar\), \(free.sulfur.dioxide\), \(total.sulfur.dioxide\) and \(sulfates\) have a positively skewed distribution. \(alcohol\) and \(citric.acid\) have an irregular shaped distributions. \(density\) and \(pH\) appears as normal distributions.


Boxplots for each of the explanatory variables.

**Figure 2.**

Figure 2.

The boxplots in Figure 2 show distribution of variables from a different angle. I can see that all variables have outliers. \(free.sulphur.dioxide\), \(density\) have few outliers far away from the most of other observations. Variables \(fixed.acidity\), \(volatile.acidity\) and \(citric.acid\) have a lot of outliers. Variables \(alcohol\) and \(citric.acid\) don’t have pronounced outliers. Variable \(quality\), \(density\) and \(pH\) have about normal distribution. Very heavily skewed distributions for \(sulphates\), \(residual.sugar\) and \(chlorides\).

Bivariate Plots & Analysis Section

Plots and Analysis of Explanatory Variables

To get the overview of the relationship between variables, I produced a pairwise comparison of explanatory variables of the dataset. The column \(quality.f\) is dropped as it is a factor type variable. The graph provides two different comparisons of each pair of columns and displays color-encoded correlation coefficient of the respective variables. The legend displays 8 levels of the coefficient from -1 to +1.

**Figure 3.**

Figure 3.

The plot in Figure 3 provides us with a very general idea of the correlations between variables. I picked some pairs with the highest correlation numbers (two darkest colors) to do some mere detailed analysis.


Scatterplots to pair up more interesting input values in the data set with added smoothed conditional mean, which helps in seeing patterns when overplotting.

**Figure 4.**

Figure 4.

Total acidity is divided into two groups, namely the volatile acids and the nonvolatile or fixed acids. One of the predominant fixed acids found in wines is citric acid. So it is not a surprise to see strong correlation betveen \(fixed.acidity\) and \(citric.acid\).

There is a negative moderate correlation between \(volatile.acidity\) and \(citric.acid\). The disadvantage of adding citric acid is its microbial instability. In the European Union, use of citric acid for acidification is prohibited.


The term “sulfites” is an inclusive term for sulfur dioxide (SO2). SO2 is a preservative and widely used in winemaking because of its antioxidant and antibacterial properties. A small amount of sulfites is produced naturally as a byproduct of fermentation, but most of the SO2 has been added by the winemaker.

**Figure 5.**

Figure 5.

Total sulfur dioxide is divided into two groups: free sulfur dioxide and bound sulfur dioxide. So, again, it obvious why \(free.sulfur.dioxide\) and \(total.sulfur.dioxide\) have a strong correlation. See Figure 5.


The measure of the amount of acidity in wine is known as the “titratable acidity” or “total acidity”, which refers to the test that yields the total of all acids present, while strength of acidity is measured according to pH, with most wines having a pH between 2.9 and 3.9.

**Figure 6.**

Figure 6.

The plot in Figure 6 shows a negative strong correlation between \(fixed.acidity\) and \(pH\).

Plots and Analysis of Explanatory Variable vs. Response Variable

To overview and better understand relationships between the output variable and all input variables I produced scatterplots pairing up all explanatory variables with the main feature \(quality.f\).

**Figure 7.**

Figure 7.

From the plot, it does look like \(fixed.acidity\) and \(quality\) has a slight positive correlation. A small number of wines of an average quality (5) has extremely high acidity. The mean for all quality levels is bigger than median, so the \(fixed.acidity\) distribution has a positive skew.


**Figure 9.**

Figure 9.

The variable \(volatile.acid\) might have a fairly even distribution and moderate negative corealtion. There are some bigger ouliers for wines with quality level 5.The mean for all quality levels is bigger or equal to median, so the distribution must be a positive skew.

**Figure 10.**

Figure 10.

The variable \(citric.acid\) might have a fairly even distribution and positive corealtion. There are 2 observations with very high outliers for wines with quality level 4.


**Figure 11.**

Figure 11.

Looking at the histogram in Figure 1 it seems the variable \(residual.sugar\) is heavily right-skewed. To better understand the data, the boxplot is produced with applied logaritmic transformation. Result shows a very light correlation and a lot of outliers in quality kategories 5-7.


**Figure 12.**

Figure 12.

This plot is also produced with \(chlorides\) with applied logaritmic transformation. Result shows a lot of outliers in quality kategories 5-6, with few ouliers at level 7. Very week negative correlation.


**Figure 13.**

Figure 13.

The mean of \(free.sulfur.dioxide\) for all quality levels is bigger than median, so the distribution must be positively skewed. Very light negative correlation.


**Figure 14.**

Figure 14.

To get better view, the chart is produced applying logaritmic transformation. It reveals a negative week correlation between variables \(quality\) and \(total.sulfur.dioxide\).


**Figure 15.**

Figure 15.

Varible \(density\) has a very small range (0.9901- 1.0037) with ouliers placed about equally to both ends of scale. The distribution is about normal.


**Figure 16.**

Figure 16.

The distribution appears normal with very few ouliers mostly located in \(quality\) levels 5-7.


**Figure 17.**

Figure 17.

After applying the logistic transformation, the plot reveals a lot of outliers in the wine of average quality at levels 5, positive correaltion.


**Figure 18.**

Figure 18.

From the plot, it apears the correalion is positively strong. Interesting distribution of amount of alcohol between levels 5 and 6. 75th percentile of alcohol of level 5 is lower than median of level 6.


Correlation Tests

The plots and analysis of explanatory variable vs. response variable revealed some insight into data. I think, it is best to compute both Spearman’s and Pearson’s correlations, since the relation between them might give some information. Spearman coefficient is computed on ranks and so depicts monotonic relationships while Pearson’s is on true values and depicts linear relationships.

  • Test for association between paired samples, using one of Pearson’s product moment correlation coefficient.
## # A tibble: 6 × 2
##           cor                                             pair
##         <dbl>                                            <chr>
## 1 -0.68297819             redwine$fixed.acidity and redwine$pH
## 2  0.67170343    redwine$fixed.acidity and redwine$citric.acid
## 3 -0.55249568 redwine$volatile.acidity and redwine$citric.acid
## 4  0.66804729        redwine$fixed.acidity and redwine$density
## 5  0.04207544       redwine$residual.sugar and redwine$alcohol
## 6  0.47616632              redwine$alcohol and redwine$quality

The Pearson product-moment correlation coefficient is a measure of the strength of the linear relationship between two variables. Data shown in the table above are Pearson’s Correlation coefficient and corresponding pair of variables. The numbers support our previous observations about the relationships between picked variables.

  • Spearman Rank Correlation test for association strength between the rankings of two variables.
## # A tibble: 6 × 2
##          rho                                             pair
##        <dbl>                                            <chr>
## 1 -0.7066736             redwine$fixed.acidity and redwine$pH
## 2  0.6617084    redwine$fixed.acidity and redwine$citric.acid
## 3 -0.6102595 redwine$volatile.acidity and redwine$citric.acid
## 4  0.6230708        redwine$fixed.acidity and redwine$density
## 5  0.1165481       redwine$residual.sugar and redwine$alcohol
## 6  0.4785317              redwine$alcohol and redwine$quality

Data shown in the table above are Spearman rho coefficient and corresponding pair of variables. The highest negative correlation is calculated between \(fixed.acidity\) and \(pH\), the highest positive correlation is for \(fixed.acidity\) and \(citric.acid\) pair.

Multivariate Plots & Analysis Section

Multinomial Logistic Regression Model

I will be using Multinomial Logistic Regression to model ordinal outcome variable, in which the log odds of the outcomes are modeled as a linear combination of the predictor variables. I begin the analysis by including all variables and all interactions between those variables.

## 
## Call:
## glm(formula = quality.f ~ fixed.acidity + volatile.acidity + 
##     citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + 
##     total.sulfur.dioxide + density + pH + sulphates + alcohol, 
##     family = binomial(link = "logit"), data = redwine)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.96787   0.00752   0.02341   0.05730   1.22446  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           494.89510  523.99352   0.944 0.344931    
## fixed.acidity          -0.28791    0.67009  -0.430 0.667446    
## volatile.acidity       -8.40765    2.50988  -3.350 0.000809 ***
## citric.acid            -3.70698    3.92708  -0.944 0.345195    
## residual.sugar          0.14205    0.29387   0.483 0.628827    
## chlorides             -13.03262    7.00680  -1.860 0.062886 .  
## free.sulfur.dioxide    -0.15367    0.08888  -1.729 0.083823 .  
## total.sulfur.dioxide    0.09925    0.04981   1.992 0.046322 *  
## density              -470.45027  533.58716  -0.882 0.377953    
## pH                     -8.01302    4.80305  -1.668 0.095253 .  
## sulphates               2.69403    3.47425   0.775 0.438088    
## alcohol                 1.32310    0.77934   1.698 0.089563 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 121.428  on 1598  degrees of freedom
## Residual deviance:  69.165  on 1587  degrees of freedom
## AIC: 93.165
## 
## Number of Fisher Scoring iterations: 10

The Multinomial Logistic Regression Model result table reveals the most influential variables to the quality by adding the significance symbols on the side of the p-value. The lowest p-value 0.000809 has \(volatile.acidity\), it is marked with 3 stars “*“.


To select a set of predictor variables from the set I performed the Stepwise Variable Selection. This is one of the available options to confirm the previous findings.

## 
## Call:
## glm(formula = quality.f ~ volatile.acidity + citric.acid + free.sulfur.dioxide + 
##     total.sulfur.dioxide + density + pH + alcohol, family = binomial(link = "logit"), 
##     data = redwine)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2580   0.0076   0.0244   0.0607   1.1836  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           474.57117  267.88706   1.772   0.0765 .  
## volatile.acidity       -9.64610    2.08812  -4.620 3.85e-06 ***
## citric.acid            -5.89262    3.00918  -1.958   0.0502 .  
## free.sulfur.dioxide    -0.14989    0.08004  -1.873   0.0611 .  
## total.sulfur.dioxide    0.10963    0.04756   2.305   0.0212 *  
## density              -458.93682  266.23979  -1.724   0.0847 .  
## pH                     -6.41360    3.57637  -1.793   0.0729 .  
## alcohol                 1.59324    0.67825   2.349   0.0188 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 121.428  on 1598  degrees of freedom
## Residual deviance:  72.299  on 1591  degrees of freedom
## AIC: 88.299
## 
## Number of Fisher Scoring iterations: 10

The selection of variables, p-values and significance codes slightly varies from the Multinomial Logistic Regression Model results, but it confirms the general trend. First of all, I can see that out of 11 input variables 4 variables are not statistically significant.

As for the statistically significant variables \(total.sulfur.dioxide\), \(alcohol\), \(volatile.acidity\), the former has the lowest p-value suggesting a strong association with the probability of having higher quality of wine. The negative coefficient for this predictor suggests that all other variables being equal, with less \(volatile.acidity\) the outcome less likely will have higher quality.

Multivariate Plot and Analysis

From the variable selection table I can see that \(volatile.acidity\) and \(alcohol\) have lowest p-values, so in dataset they might have the biggest input to the final \(quality\) result.

**Figure 19.**

Figure 19.

In the Figure 19 the plot of the distribution of \(volatile.acidity\) vs \(alcohol\) reveals quite clearly the clustering by color-coded quality levels. The lowest quality wines have higher volatile acidity and lower alcohol level. The highest quality wines have higher alcohol levels, slightly lower volatile acidity.

Final Plots and Summary

Plot One: Distribution of Red Wine Quality

**Figure 20.**

Figure 20.

Summary of the \(quality.f\) variable

##   1   2   3   4   5   6   7   8   9  10 
##   0   0  10  53 681 638 199  18   0   0

Description One

As shown in the histogram in Figure 19 and summary, \(quality.f\) variable has most values concentrated in the categories 5, 6. Only a small proportion is in the rest of categories. There are no values in category 1, 2, 9, 10. That means in the sample of tested wines, there wasn’t any very bad or very good wines presented for the testing. This makes me question the credibility of the data set.

Plot Two: Correlation Between Objective Parameters

**Figure 21.**

Figure 21.

Description Two

As shown in Figure 10, \(free.sulfur.dioxide\) and \(total.sulfur.dioxide\) variables show the strongest correlation among all wine parameters (see Spearman Rank Correlation table ) and it equals to 0.789.

From the chart, it does look like there might be a threshold of about 100 for higher quality wines. But I’m not sure that the chart shows that low quality wines have higher sulphur dioxide. Most of the low quality wine is clustered in the upper or lower portion of the graph, while high quality wine is around mid-left region.

Plot Three: Distribution of Alcohol vs. Volatile Acidity

**Figure 22.**

Figure 22.

Description Three

The \(volatile.acidity\) of the wines is one of the best predictors of the quality. The clustering seen in the chart Figure 11, we might say it can be used to predict the \(quality\) of a red wine given \(volatile.acidity\) and \(alcohol\) values. The best quality wines have lower levels of the volatile acidity, and alcohol level above 10. Regression lines depict the separation for different quality ratings.


Reflection

Wine chemistry explains the flavor, balance and color of wine. My exploration and analysis process of red wine dataset started looking for more information on the wine chemistry basics, fermentation process, and additives, which help to improve the quality of wines. My biggest struggles working on this project 1) was selecting testing methods, predictive models based on my data type, since the regression analysis includes many techniques for modeling. 2) the actual analysis, interpreting and describing the result of the plot. I think, in the class there could be presented a bigger variaty of samples, quizes or assignements, or maybe one part of a lesson could be dedicated to an overview of all available metods and techniques, when, with and what kind of data could be used with each, without going into very details or specifics.

My conclusion: the tester decisions on wine quality levels are based on their personal testes. Only very few variables have strong correlation with quality of wine. A notion in wine industry is accepted that the balance of taste and chemical ingredients is as follows:

Sweet Taste (sugars + alcohols) <= => Acid Taste (acids) + Bitter Taste (phenols)

Can we draw any conclusion about the relationship between the quality and the chemical compunds in wine, since we are presented with measurements of a small portion of elements - only a handfull of elements of the acid group, no elements of phenol group?

Also, as the quality levels of our dataset show, the sample of tested wines did not include any very low or very high quality wines. It might mean the sample is not random, which makes me question the analysis and any of my findings, which might be very well inaccurate.

I take this analysis as good practice to learn R language and RStudio, and deepen my knowledge in statistics.

Resources

http://www.calwineries.com/learn/wine-chemistry/

https://www.rstudio.com/wp-content/uploads/2015/03/rmarkdown-reference.pdf

http://zevross.com/blog/2014/08/04/beautiful-plotting-in-r-a-ggplot2-cheatsheet-3/